A Scalable Approach to Harvest Modern Weblogs

نویسندگان

  • Vangelis Banos
  • Olivier Blanvillain
  • Nikos Kasioumis
  • Yannis Manolopoulos
چکیده

Blogs are one of the most prominent means of communication on the web. Their content, interconnections and influence constitute a unique socio-technical artefact of our times which needs to be preserved. The BlogForever project has established best practices and developed an innovative system to harvest, preserve, manage and reuse blog content. This paper presents the latest developments of the blog crawler which is a key component of the BlogForever platform. More precisely, our work concentrates on techniques to automatically extract content such as articles, authors, dates and comments from blog posts. To achieve this goal, we introduce a simple yet robust and scalable algorithm to generate extraction rules based on string matching using the blog’s web feed in conjunction with blog hypertext. Furthermore, we present a system architecture which is characterised by efficiency, modularity, scalability and interoperability with third-party systems. Finally, we conduct thorough evaluations of the performance and accuracy of our system.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Role of Weblogs in Iranian EFL Learners’ Vocabulary Knowledge and Writing Ability

Nowadays Information and Communications Technology (ICT) is becoming not only enormously popular but also increasingly important in our lives and education system. Generally, learners are usually eager to work on computers or with various kinds of modern technology. This research was carried out to find out whether using weblogs is effective in Iranian EFL learners’ vocabulary and writing skill...

متن کامل

Dynamic configuration and collaborative scheduling in supply chains based on scalable multi-agent architecture

Due to diversified and frequently changing demands from customers, technological advances and global competition, manufacturers rely on collaboration with their business partners to share costs, risks and expertise. How to take advantage of advancement of technologies to effectively support operations and create competitive advantage is critical for manufacturers to survive. To respond to these...

متن کامل

Intelligent scalable image watermarking robust against progressive DWT-based compression using genetic algorithms

Image watermarking refers to the process of embedding an authentication message, called watermark, into the host image to uniquely identify the ownership. In this paper a novel, intelligent, scalable, robust wavelet-based watermarking approach is proposed. The proposed approach employs a genetic algorithm to find nearly optimal positions to insert watermark. The embedding positions coded as chr...

متن کامل

Scalable Discovery of Contradicting Opinions in Weblogs

Weblogs are a popular means of information communication, where people discuss a variety of topics, and often times also express their opinions on these topics. In this work, we address the problem of analyzing the evolution of community opinions across time, as these are represented in the weblogs. In particular, we are interested in identifying topics and time windows, for which contradictory...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • International Journal on Artificial Intelligence Tools

دوره 24  شماره 

صفحات  -

تاریخ انتشار 2015